Proposal for dbfs file access api with mocked test suite#3236
Open
danzafar wants to merge 2 commits intodatabrickslabs:mainfrom
Open
Proposal for dbfs file access api with mocked test suite#3236danzafar wants to merge 2 commits intodatabrickslabs:mainfrom
danzafar wants to merge 2 commits intodatabrickslabs:mainfrom
Conversation
nfx
requested changes
Nov 11, 2024
Collaborator
nfx
left a comment
There was a problem hiding this comment.
don't directly depend on py4j, see techniques. Otherwise the approach is good.
pyproject.toml
Outdated
| "sqlglot>=25.5.0,<25.30", | ||
| "astroid>=3.3.1"] | ||
| "astroid>=3.3.1", | ||
| "py4j==0.10.9.7"] |
Collaborator
There was a problem hiding this comment.
we cannot add py4j as a dependency to UCX, as it'll have to be packaged with transitive dependencies to run CLI.
| def _jvm(self): | ||
| try: | ||
| _jvm = self._spark._jvm | ||
| self._java_import(_jvm, "org.apache.hadoop.fs.FileSystem") |
Collaborator
There was a problem hiding this comment.
Suggested change
| self._java_import(_jvm, "org.apache.hadoop.fs.FileSystem") | |
| FileSystem = jvm.org.apache.hadoop.fs.FileSystem |
use the same technique as here, where you don't have to depend directly on py4j:
https://github.com/databrickslabs/ucx/blame/main/src/databricks/labs/ucx/hive_metastore/locations.py#L397-L401
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes
scan-tables-in-mounts.Why this PR?
Refactoring
scan-tables-in-mountsto enhance performance will need to leverage parallelized crawling ofdbfs:/mnt/locations. @nfx suggested using backend Hadoop libraries throughpy4jfor this through theSparkSession. Since this is a fairly big change, I wanted to make sure we are aligned on the backend file lister before I proceed to modify any existing files.Enable
dbfsfile listingThis PR proposes the
DbfsFilesclass which uses theSparkSession's java backend withpy4jto leverage the following Hadoop libraries for efficientdbfs:/(andmount) file access:org.apache.hadoop.fs.FileSystemorg.apache.hadoop.fs.PathAs of now, the only useful method is
list_dir, but if the approach is confirmed I plan to add crawling capabilities leveragingdatabricks.labs.blueprint.parallel.Test strategy
A testing strategy was painstakingly created to:
scan-tables-in-mounts. I created a simple trie-based mock file system which can be leveraged forscan-tables-in-mountsfunctionality after the refactor. This mock file system also has it's own unit tests.Functionality
databricks labs ucx .........Tests